arXiv: 1506.01192v3 [cs.CL] 22 Nov 2016 


PERSONALIZING UNIVERSAL RECURRENT NEURAL NETWORK LANGUAGE MODEL 
WITH USER CHARACTERISTIC FEATURES BY SOCIAL NETWORK CROWDSOURCING 


Bo-Hsiang Tseng, Hung-yi Lee, and Lin-Shan Lee 

National Taiwan Univeristy 
r02942037©ntu.edu.tw, lslee@gate.sinica.edu.tw 


ABSTRACT 

With the popularity of mobile devices, personalized speech 
recognizer becomes more realizable today and highly at¬ 
tractive. Each mobile device is primarily used by a single 
user, so it’s possible to have a personalized recognizer well 
matching to the characteristics of individual user. Although 
acoustic model personalization has been investigated for 
decades, much less work have been reported on personal¬ 
izing language model, probably because of the difficulties 
in collecting enough personalized corpora. Previous work 
used the corpora collected from social networks to solve 
the problem, but constructing a personalized model for each 
user is troublesome. In this paper, we propose a universal 
recurrent neural network language model with user charac¬ 
teristic features, so all users share the same model, except 
each with different user characteristic features. These user 
characteristic features can be obtained by crowdsouring over 
social networks, which include huge quantity of texts posted 
by users with known friend relationships, who may share 
some subject topics and wording patterns. The preliminary 
experiments on Facebook corpus showed that this proposed 
approach not only drastically reduced the model perplexity, 
but offered very good improvement in recognition accuracy 
in n-best rescoring tests. This approach also mitigated the 
data sparseness problem for personalized language models. 

Index Terms — Recurrent Neural Network, Personalized 
Language Modeling, Social Network, LM adaptation 

1. INTRODUCTION 

The personalization of various applications and services for 
each individual user has been a major trend. Good examples 
include personalized web search HI 12 and personalized rec¬ 
ommendation systems |[3][4l[5l[6l. In the area of speech recog¬ 
nition, the popularity of mobile devices such as smart phones 
and wearable clients makes personalized recognizers much 
more realizable and highly attractive. Each mobile device is 
used primarily by a single user, and can be connected to a 
personalized recognizer stored in the cloud with much better 
performance, because this recognizer can be well-matched to 
the linguistic characteristics of the individual user. 

In acoustic model adaptation |17][8l|3, personalization has 


been investigated for decades and has yielded very impres¬ 
sive improvements with many approaches based on either 
HMM/GMM or CD-DNN-HMM COl. However, there has 
been much less work reported on language model (LM) per¬ 
sonalization. LM adaptation has been studied for decades ifTTl 
ElIIl, but the previous works ifT^ [TSl [Tbl [TtI [TSl primar¬ 
ily focused on the problem of cross-domain or cross-genre 
linguistic mismatch, while the cross-individual linguistic mis¬ 
match is often ignored. One good reason for this is perhaps 
the difficulty in collecting personalized corpora for person¬ 
alized LMs. However, this situation has changed in recent 
years. Nowadays, many individuals post large quantities of 
texts over social networks, which yield huge quantities of 
posted texts with known authors and given friend relation¬ 
ships among the authors. It is therefore possible to train 
personalized LMs because of the reasonable assumption that 
users with close friend relationships may share common sub¬ 
ject topics, wording habits, and linguistic patterns. 

Personalized LMs are useful in many aspects OSlEolED. 
In the area of speech recognition, personalization of LMs 
has been proposed and investigated for both N-gram-based 
LMs 1^ and recurrent neural networks (RNNLMs) 1^ in 
the very limited previous works. In these previous works, text 
posted by many individual users and other information (such 
as friend relationships among users) were collected from so¬ 
cial networks. A background LM (either N-gram-based or 
RNN-based) was then adapted toward an individual user’s 
wording patterns by incorporating social texts that the target 
user and other users had posted, considering different aspects 
of their relationships and similarities between the users. In 
these previous works, personalization was realized by training 
an LM for each individual. There are inevitable shortcomings 
with this framework. First, even with help of the social net¬ 
works, it is not easy to obtain text corpora that are helpful 
for a particular user for adapting a background LM towards a 
personalized LM. As a result, the personalized LM thus ob¬ 
tained easily overfits to the limited data, and therefore yields 
relatively poor performance on the new data of the target user. 
Second, to train and store a personalized LM for every user 
is in any case time-consuming and memory-intensive, espe¬ 
cially considering that the number of users will only increase 
in the future. 


Considering the above-mentioned defects in the previ¬ 
ous framework, in this paper we propose a new RNNLM- 
based paradigm for personalizing LMs. In conventional 
RNNLM (24l|25l|26l, the 1-of-N encoding of each word is 
taken as the input of the RNN, and then given the history 
word sequence, RNN outputs the estimated probability dis¬ 
tribution for the next word. In the new paradigm proposed 
here, however, each user is represented by a feature vector 
encoding some characteristics of the user, and this feature 
vector augments the 1-of-N encoding feature of each word. A 
universal RNNLM is thus trained based on the data of these 
user features,together with the texts over social networks by a 
large number of users. The standard training method is used, 
except now the same words produced by different users in 
the training set are augmented by different user characteris¬ 
tic features. For each new user, his characteristic feature is 
extracted to extend the 1-of-N word encoding, with which 
the universal RNNLM can be used to recognize his speech. 
Because the same words produced by different users are aug¬ 
mented with different features, given the same history word 
sequence, the universal RNNLM can predict different distri¬ 
butions of the next word for different users. In this way, the 
personalization can be achieved even though all users share 
the same universal RNNLM. This universal RNNLM trained 
from the social text produced by many users is less liable to 
overfitting because a very large training set can be obtained 
by aggregating the social texts of many users. Moreover, 
since the recognizer for each user only requires the user’s 
characteristic features rather than an entirely new model, the 
new paradigm saves time during training and memory in real- 
world implementations. This concept of input features for 
personalization is similar to the i-vectors used in deep neural 
network (DNN) based acoustic models (271 Ell. in which 
the i-vector of each speaker is used to extend acoustic fea¬ 
tures such as MFCC. Preliminary experiments show that the 
proposed method not only reduces model perplexity but also 
reduces word error rates in n-best rescoring tests. In addition, 
we find that this approach mitigates the overfitting problem 
for limited personalized data, can be helpful in extracting the 
target user’s characteristic features. 

2. LM PERSONALIZATION SCENARIO 

Crowdsourcing (^ |30l has varying definitions and has been 
applied to a wide variety of tasks. For example, a crowd¬ 
sourcing approach was proposed to collect queries for infor¬ 
mation retrieval considering temporal information ED- The 
MIT movie browser 021 HD build a crowd-supervised spo- 
ken language system. In this work, a cloud-based applica¬ 
tion was implemented offering users to access to their social 
network via voice, and was treated as a crowdsourcing plat¬ 
form for collecting personal data. When the user logs into his 
Facebook account, he may choose to grant this application 
the authority to collect his acoustic and linguistic data for use 
in personalizing the voice access service. Users who do so 



Fig. 1: The scenario for the proposed approach. When train¬ 
ing with sentence ifrom user A, the user feature fed into the 
RNNLM can be produced by either topic distribution of user 
A's personal corpus or searching over user A’s personal / 
friends corpora for sentences with topic distributions closest 
to this sentence i. 

may enjoy the benefits of the superior recognition accuracy 
yielded by the personalized recognizer via the crawled data. 

Fig. depicts the scenario of the proposed approach. For 
user A, the red figure in the left part of the figure, the texts 
of his social network posts are crawled to form the personal 
corpus of the user (the red circle). Besides, the posts of all 
user A’s friends (blue figures) in social network are collected 
to form user A’s friends corpora (the red cloud surrounding 
the circle). In previous work (231 the user’s personal corpus 
and friends corpora was used for adapting a background LM 
trained with a large background corpus. However, such LM 
adaptation suffers from overfitting due to the limited adapta¬ 
tion data, and as mentioned above, also incurs heavy train¬ 
ing/memory burdens. 

In this paper, instead of building a personalized RNNLM 
for each user, a single universal RNNLM is used by all users. 
As shown in the right part of Fig.[^ a corpus of posts from a 
large group of users serves as the training data for the univer¬ 
sal RNNLM. This universal RNNLM comprises three layers: 
the input layer, the hidden layer, and the output layer, very 
similar to those used previously (24l . except the input layer is 
not only the word vector w(t) representing the t-th word in 
a sentence using an 1-of-N encoding, but concatenated with 
the additional user characteristic feature /. This user charac¬ 
teristic feature is connected to both the hidden layer s(t) and 
output layer y(t)^ This feature / enables the model to take 
into account each specific user. The network weights to be 
learned are the matrices W, F, S, G and O in the right part of 
the figure. 

3. EXTRACTION OF USER CHARACTERISTIC 
FEATURES 

We proposed two approaches to extract the user characteristic 
feature for each sentence, which are respectively described in 
Subsections 13.11 and 13.21 

Uhis structure is parallel to the context dependent RNNLM variant ( 26 ) 
except that the context feature in the input layer is replaced by the user char¬ 
acteristic feature/. 
























3.1. User-dependent Feature 

In this approach, the personal corpus for each target user is 
viewed as a single document, and then a topic modeling ap¬ 
proach is used to derive the topic distribution of that docu¬ 
ment. The topic distribution of the personal corpus thus rep¬ 
resents the language characteristics of the user and is consid¬ 
ered as the user characteristic feature / of the user. That is, 
during training the universal RNNLM, the 1-of-N encoding 
of the words in a personal corpus are all concatenated with 
the same topic distribution of that personal corpus. The topic 
model used here is Latent Dirichlet Allocation (LDA) l(^ 
model trained from a large corpus for many users. 


3.2. Sentence-dependent Feature 

Considering the fact that the personalized corpus of a user 
may cover many different topics, and the topic of the user may 
be switched dynamically and freely from one to another in the 
personal corpus, the topic distribution for the whole personal 
corpus may not very well represent each individual sentence 
within the personal corpus. On the other hand, even though 
the topic can be switched freely in the personal corpus of a 
user, we observe that it usually needs at least a few sentences 
to finish a specific topic. Therefore, to form a feature not 
only refiecting the characteristics of user but also a specific 
sentence, we can exploit a part of the personal corpus whose 
topic distribution is close to the sentence. This may solve 
the problem of mismatch between the topic distribution of the 
whole personal corpus and each individual training sentence. 

With the above consideration, in the second approach, ev¬ 
ery sentence in the personal corpus of a user has its unique 
feature/which is related to not only the user but the sentence 
itself. In other words, the topic model is first used to infer the 
topic distribution of a sentence, we then use this topic distri¬ 
bution to search over the personal corpus of the user to find 
other N sentences whose topic distributions are most closest 
to one formed for the sentence being considered. This search 
process is fast since it is limited to personal corpus of the con¬ 
sidered user only. During training the universal RNNLM, the 
average of the topic distributions of these N found sentences 
is taken as the user characteristic feature/ to be concatenated 
with the 1-of-N encoding features of the words in the sen¬ 
tence. Therefore, the same words in different sentences of 
a personal corpus may have different user characteristic fea¬ 
tures. We can also extend the search space to be over the 
friends corpora of the user as well. 

The major difference of the two approaches in Subsec¬ 
tions ^ and ^ lies in the concept of how a better language 
model can be obtained. In the first approach, we assume the 
personal corpus of a user can refiect his language character¬ 
istics, so the data for inferring the topic distribution is the 
whole personal corpus. In the second approach, we assume a 
user actually switches his topic freely from sentence to sen¬ 
tence, so we try to find the similar sentences to construct the 
user character feature to refiect the language characteristics 


not only for the user but for the specific sentence itself. So, 
the data to form the user characteristics is limited to the N sen¬ 
tences found in the search process. During testing, the user 
characteristic feature is obtained in exactly the same way, ex¬ 
cept the N-best list of an utterance was used with the LDA 
model to generate the topic distribution for an utterance. 


4. EFFECTS OF THE USER CHARACTERISTIC 
FEATURE ON RNNLM 

Here we use a real example from the Facebook data to show 
the effect of the user characteristic features on RNNLM. User 
A left many posts about coffee in the Facebook data, while 
user B never did so. This yielded very different user char¬ 
acteristic features for the two users. Here the user charac¬ 


teristic features mentioned in subsection |3.2| by searching for 
the N closest sentences are used. Given the sentence ''A bot¬ 
tle of milk can make 3 cups of latte ” which was more likely 
to be produced by user A, we list in Table the perplexi¬ 
ties evaluated by a conventional RNNLM and the personal¬ 
ized RNNLM with different user characteristic features. The 
conventional RNNLM is in row (a). We see that the person¬ 
alized RNNLM with the user characteristic feature/^ of user 
A produced a drastically decreased perplexity ( 152 vs 355, 
row (b) ) because of the well-matched characteristics, while 
that with the user characteristic feature /b of user B yielded a 
significantly increased perplexity ( 604 vs 355, row (c)). 


Language Models 

Perplexity 

(a) RNNLM (conventional) 

355 

(b) RNNLM (with/ a) 

(c) RNNLM (with/B) 

152 

604 


Table 1: The perplexity for sentence “A bottle of milk can 
make 3 cups of latte” using different models, where user As 
personal corpus included many posts about coffee but user 
B’s personal corpus contained no such posts. 

5. EXPERIMENTAL SETUP 
5.1. Corpus & LMs 

Our experiments were conducted on a crawled Facebook 
corpus. A total of 42 users logged in and authorized this 
project to collect for research their posts and basic informa¬ 
tion. These 42 users were our target users, and were divided 
into 3 groups for cross validation, i.e., to train the universal 
LM using the data of two groups and test those for the rest. 
Furthermore, with their consent, the observable public data 
(the personal and friends corpora) of these 42 target users 
were also available for the experiments. This resulted in the 
personal data of 93,000 anonymous people and a total of 2.4 
million sentences. The number of sentences for each user 
among the 93,000 ranged from 1 to 8,566 with a mean of 
25.7, comprising 10.6 words (Chinese, English, or mixed) 
per sentence on average. A total of 12,000 sentences for the 
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42 target users was taken as the testing set, and among them 
948 produced by the respective target users were taken as 
testing utterances for ASR experiments. 

For the background corpus, 500k sentences were col¬ 
lected from the popular social networking site Plurk to train 
the topic model. Using the Mallet toolkit ED, we trained 
a latent Dirichlet allocation-based (LDA) topic model, tak¬ 
ing each sentence as a document. The modified Kneser-Ney 
algorithm |[36ll was used for the N-gram LM smoothing. 
From the corpus the most frequent 18,000 English words and 
46,000 Chinese words were selected to form the lexicon. The 
SRILM IJTI toolkit was used for the N-gram LM training and 
adaptation, while RNNLM toolkit was used for RNNLM 
here. 

5.2. N-best rescoring 

To generate the 1,000-best lists for rescoring, we used lattices 
produced using the HTK toolkit 13^ . To generate the lat¬ 
tices we used a trigram LM adapted to the personal and friend 
corpora using Kneser-Ney smoothing (KN3). For first-pass 
decoding we used Mandarin triphone models trained on the 
ASTMIC corpus and the English triphone models trained on 
the Sinica Taiwan English corpus BOl ; both corpora include 
hundreds of speakers. Both sets of models were adapted using 
unsupervised MLLR. 


6. EXPERIMENTAL RESULTS 


6.1. Extraction of user characteristic features 


As mentioned in Section 3.2 only those N sentences most 
close to the sentence under consideration were used to build 
the user characteristic feature. Fig. are the perplexities for 
different N (out of the use^lus friend corpora) and different 
number of topics for LDAj^ The figure shows that there was 
almost no difference between N = 1 and N = 2, but as N 
increased beyond 2 the perplexity also increased, suggesting 
a wide variety of topics even for the same user and his friends. 
We thus chose = 1 for the following experiments. 


6.2. Perplexity 

Table shows the results of perplexity (PPL). Personalized 
Kneser-Ney tri-gram is reported in section (a) 1^ . where ‘B’, 
‘B-fP’, ‘B-^P-FE’indicate respectively background (B), back¬ 
ground plus personal corpus (B-fP), and plus friends corpora 
in addition (B-i-P-i-F). Row (b) is RNNLM using only back¬ 
ground corpus (‘B’) without any personalization with hidden 
layer size of 50 and 200. Personalizing RNNLM based on 
model adaption (model) and the user characteristic fea¬ 
ture (UCF) approach proposed here are respectively labeled 
with ‘RNN/model’in section (c) and ’RNN/UCF’ in section 
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Fig. 2: Perplexities for different number of LDA topics and 
different number of similar sentences (N) selected to build 


the user characteristic feature in 3.2 


(d). In section (d), notations ‘UD’and ‘SD’respectively indi¬ 
cate extracting user-dependent (subsection |3.1| ) and sentence- 
dependent (subsection |3.2| ) features in the proposed approach. 

We found that sentence-dependent feature outperformed 
the user-dependent feature ((d-2) v.s. (d-1)), which implies 
that the considerations of topic switching in subsection |3.2| is 
reasonable. Under the condition involving personal corpora 
(B-fP), no matter using user-dependent or sentence-dependent 
features, the approach proposed here is always better than the 
model adaption approach ( (d-1) (d-2) v.s. (c-1) ). With the 
sentence-dependent features, PPL improvement is up to 102 ( 
218 in (d-2) v.s. 320 in (c-1) in h200 ). This means extracting 
a good feature to characterize the user is more efficient than 
using personal data to learn a personalized RNNLM. With the 
friends corpora involved (B-fP-fF|^ the proposed approach is 
still better than the model adaption approach ( 211 in (d-3) 
v.s. 265 in (c-2) ). When using sentence-dependent feature 
we may further average the found user characteristic feature 
with the topic distribution of the sentence being considered ( 
RNN/UCF, SD, B-i-P-i-F, avg in (d-4) ). PPL in this case can 
be further improved (165 in ( d-4) v.s. 211 in (d-3) ). With 
this best model obtained here ( RNN/UCF, SD, B-fP-fF, avg 
), the perplexity is reduced by 58.5% compared to RNNLM 
without personalization ( RNN, B in (b) ), 37.7% compared 
to the model adaption approach with friends corpora used ( 
RNN/model, B-fP-fF in (c-2)). 


6.3. Word error rate (WER) 

Table reports the word error rates (WER) with the same 
notation as in Table Section (a) is for the three different 
tri-gram LMs without and with personalization. As expected, 
with more adaptation data, the tri-gram LMs performed better 
((a-3)<(a-2)<(a-l)) 12^ . We used the best adapted tri-gram 
LM ( KN3 ,B-fS-fF in (a-3) ) to generate 1000-best lists for 
RNNLM rescoring. Section (b) is for rescoring results using 
RNNLM without personalization, while sections (c) and (d) 


^Only one-tenth of the personal and friend corpus was used in these pre¬ 
liminary experiments. 


^When extracting the sentence-dependent feature, search space is over 
both personal and friends corpora. 







Perplexity 

h50 

h200 


(a-l)KN3,B 

343 

(a) 

(a-2) KN3, B+F 

299 


(a-3) KN3, B+F+F 

233 

(b) 

RNN,B 

441 

398 

(c) 

(c-1) RNN/model, B+F 

350 

320 

(c-2) RNN/model, B+F+F 

296 

265 


(d-1) RNN/UCF, UD, B+F 

313 

270 

(d) 

(d-2) RNN/UCF, SD, B+F 

269 

218 

(d-3) RNN/UCF, SD, B+F+F 

229 

211 


(d-4) RNN/UCF, SD, B+F+F, avg 

192 

165 


Table 2: Perplexity (PPL) Results. KN3 represents Kneser- 
Ney tri-gram, while ‘RNN/model’and ‘RNN/UCF’are for 
Personalizing RNNLM based on model adaption (model) and 
the user characteristic feature (UCF) approach proposed here 
respectively. Notation ‘B’, ‘B-^P’and ‘B-fP-fF’ respectively in¬ 
dicate using only background corpus (B), plus personal cor¬ 
pus (B-fP), and plus friends corpora in addition (B-fP-fF). 
Notation ‘UD’and ‘SD’respectively indicate extracting user- 
dependent and sentence-dependent features in the proposed 
approach. The results for RNNLM with hidden layer size of 
50 and 200 are listed. 


are for model adaption (model) approach and the proposed 
approach (UCF) respectively. For sentence-dependent (SD) 
features, we viewed the 1000-best list as a single document 
and used the LDA topic model to infer the topic distribu¬ 
tion, and then search for the closest sentences as mentioned 
in subsection |3.2| to construct the sentence-dependent feature 
of each utterance for rescoring. For user-dependent (UD) 
features in the proposed approach, because the feature is ex¬ 
tracted from the personal corpus of the user and independent 
of the input utterance, so the feature extraction process does 
not depend on ASR. Regardless of the features used and the 
data involved, the proposed approach was always better than 
the model adaption approach ((d-1, 2) v.s. (c-1) and (d-3, 4) 
v.s. (c-2)). 

To our surprise, for 200 hidden layer units the proposed 
approach with user-dependent feature is better than sentence- 
dependent feature in terms of WER ((d-1) v.s. (d-2, 3, 4) for 
h200 ). This may be because the user-dependent feature is 
estimated from the training corpus of target user thus not in¬ 
fluenced by ASR errors at all; while for sentence-dependent 
feature, the topic distribution from N-best list was inaccurate 
due to ASR errors. This was verifled in the oracle experiments 
in section (e), in which we used the topic distribution of the 
reference transcription of the utterance to replace the topic 
distribution of N-best list and do the rescorin^ Here we 
see the sentence-dependent feature (SD) is better than user- 
dependent feature (UD) ((e-2) v.s. (e-1) while (e-3) (e-4) are 
even better ). Also, with topic distributions from the refer- 


^In the oracle experiments in section (e), results of user-dependent fea¬ 
ture (e-1) were the same as those in (d-1) because ASR was not involved in 
extracting the feature. 


WER (%) 

h50 

h200 


KN3,B 

43.80 

(a) 

(a-1) KN3, B-fP 

43.39 


(a-2) KN3, B+F+F 

41.95 

(b) 

(a-3) RNN, B 

41.12 

41.14 

(c) 

(c-1) RNN/model, B-fP 

40.84 

40.87 

(c-2) RNN/model, B+F+F 

40.71 

40.68 


(d-1) RNN/UCF, UD, B+F 

40.48 

40.16 

(d) 

(d-2) RNN/UCF, SD, B+F 

40.47 

40.36 

(d-3) RNN/UCF, SD, B+F+F 

40.43 

40.40 


(d-4) RNN/UCF, SD, B+F+F, avg 

40.23 

40.26 

(e) 

D 

(e-1) RNN/UCF, UD, B+F 

40.48 

40.16 

(e-2) RNN/UCF, SD, B+F 

40.15 

40.09 

"o 

(e-3) RNN/UCF, SD, B+F+F 

40.03 

39.95 

$-1 

O 

(e-4) RNN/UCF, SD, B+F+F, avg 

39.40 

39.45 


Table 3: Word error rate (WER) results with same notations 
as in Tablej^ For sentence-dependent (SD) features, the topic 
distributions are estimated from N-best lists in section (d), 
while from reference transcriptions in section (e) (oracle). 

ence transcriptions, the results of sentence-dependent feature 
can be improved by absolute 0.81% ( from 40.26% in (d-4) to 
39.45% in (e-4) for h200 ). 

So for the real best result here ( RNN/UCF, UD, B-fP for 
h200 in (d-1)), we reduced WER by 1.79% compared to the 
best of KN3 including friends corpora ( 41.95% in (a-3) ), 
from which the 1000-best lists were obtained, 0.98% com¬ 
pared to RNNLM without personalization ( 41.14% in (b) ), 
0.52% compared to the best of model adaption approach ( 
40.68% in c-2)). In the oracle case the best result can be even 
much better ( 39.45% RNN/UCF, SD, B-fP-fF, avg in (e-4)), 
which indicates the space for further improvement. 


6.4. Analysis 

6.4.1. WER over all target users 

Because the average didn’t tell whether the proposed ap¬ 
proach is actually helpful for most users or for just very few 
users, we plot in addition the WER change obtained across 
the all 42 target users in Fig.|^ The three flgures in the upper 
row compare respectively the proposed approach with user- 
dependent feature ( RNN/UCF, UD, B-fP, in (d-1) of Table 
), sentence-dependent feature ( RNN/UCF, SD, B-fP-fF, avg, 
in (d-4) of Table [^) and sentence-dependent feature in oracle 
experiments ( RNN/UCF, SD, B-i-P-i-F, avg, in (e-4) ) with 
the baseline of RNNLM without personalization ( row (b) 
in Table [^), and the three flgures in lower row compare the 
same three approaches with the model adaption approach ( 
row (c-1) in Table [^). Each flgure has 42 bars for the 42 
target users, sorted based on the WER change. Here a neg¬ 
ative value means that the proposed approach here offered 
WER reduction to the user. From Fig. we see the pro¬ 
posed approach offered better performance to most target 
users. For example, in the first flgure on the upper row ( 
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Fig. 3: WER changes across all 42 target users. The three figures in the upper row compare respectively the proposed approach 
with user-dependent feature ( RNN/UCF, UD, B+P, in (d-l) of Table[^), sentence-dependent feature ( RNN/UCF, SD, B+P+F, 
avg, in (d-4 ) of Table[^ and sentence-dependent feature in oracle experiments (RNN/UCF, SD, B+P+F, avg, in (e-4)) with the 
baseline of RNNFM without personalization ( row (b) in Table [^), and the three figures in lower row compare the same three 
approaches with the model adaption approach ( row (c-1) in Table [^). 


(d-l) RNN/UCF, UD, B+P v.s. (b) RNN,B ), 9 users had 
worse WER with our approach, all by less than 1%, but all 
other users had WER reduction, 24 of them by more than 
1%. Similar for the rest cases. The results show the proposed 
approach offered improvements to most target users. 

6.4.2. Size of personal corpus 

As mentioned above, the model adaptation approach results 
in overfitting to the limited personal data and may yield 
poor performance on a particular user’s new data. This is 
illustrated in Fig. The horizontal axis of the figure is the 
percentage of the original personal corpus used, where 1.00 
means using the entire original personal corpus, that is, those 
cases (c-1) RNN/model, (d-l) RNN/UCF, UD, B+P and (d-2) 
RNN/UCF, SD, B+P in Tables |2] and for h50. We see that 
as less data were available, the proposed approach (d-l) and 
(d-2) demonstrated much smaller increases in perplexity and 
much more stable WER, whereas for the model adaptation 
approach (c-1), the perplexity and WER increased signifi¬ 
cantly at a greater rate. The result of different size of friends 
corpora has the same trend. 

7. CONCLUSIONS 

In this paper, we proposed a new framework for personaliz¬ 
ing a universal RNNFM using data crawled over social net¬ 
works. The proposed approach is based on a user characteris¬ 
tic feature extracted from the user corpus and friends corpora, 
which is not only user-dependent but sentence-dependent fea- 


(c-l) RNN/model, B+P ■ ppl (Table 2) —WER (Table 3) 

(d-l) RNN/UCF, UD, B+P H ppl (Table 2) — WER (Table 3) 



0.25 0.50 0.75 1.00 

percentage of data used 


Fig. 4: Perplexity (PPL) and word error rate (WER) for 
different sizes of personal corpus for the model adaptation 
approach ( (c-1) RNN/model, B-\-P) and the proposed ap¬ 
proaches ( (d-l) RNN/UCF, UD, B+P and (d-2) RNN/UCF, 
SD, B+P ). The horizontal axis is the percentage of the orig¬ 
inal personal corpora used ( 1.00 on the right is for the data 
in Tables^and^). 

ture. This universal RNNFM can predict different word dis¬ 
tributions for different users given the same context. Exper¬ 
iments demonstrated really good improvements in both per¬ 
plexity and WER, and the proposed approach is much more 
robust to data sparseness than the previous work. 
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